PHE publishes a range of information to https://www.gov.uk/phe inlcuding blogs, statistics and other publications.
This vignette introduces an internal PHE R package myScrapers which includes tools to extract published public health information from:
The functions offered include
blog_type_extractor which pulls a dataframe of blog topics or authorsphe_blog_scraper which allows the user to retrieve the text of blogs by topic into a tidy data frame.phe_pubs which downloads an interactive table of PHE publications on .GOV.UKget_nhsd_pubs, get_NHSD_header, get_NHSD_sheets,get_NHSD_metadata,get_NHSD_header which can be combined to identify published public health spreadsheets on the NHS Digital websiteget_nice_guidance which downloads as a single file, recommendations from NICE public health guidance.These tools are intended to help text mining and analysis of published content, to assist developing outputs such as products and services catalogues, and to make it easier to identify public health content at NHS Digital.
The functions are assembled as an R package which is currently available on github at https://github.com/julianflowers/myScrapers.
The first step is to install the package. This can be achieved with the code below:
Identified PH content is available from https://digital.nhs.uk/article/4375/Public-health which give the following page which is a search spread over 3 pages.
We can set up the URLs to be scraped:
url <- "https://digital.nhs.uk/article/4375/Public-health"
url1 <- paste0(url ,"?p=", 2:3)
links <- c(url, url1)
links
#> [1] "https://digital.nhs.uk/article/4375/Public-health"
#> [2] "https://digital.nhs.uk/article/4375/Public-health?p=2"
#> [3] "https://digital.nhs.uk/article/4375/Public-health?p=3"Then use the get_nhsd_pubs function to identify a list of relevant publications. We use the purrr::map function to apply the extraction function to each page.
ph_pubs <- map(links, function(x) get_nhsd_pubs(x))
test <- flatten(ph_pubs) %>% na.omit()
test[1:3]
#> [[1]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30258"
#>
#> [[2]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30256"
#>
#> [[3]]
#> [1] "https://digital.nhs.uk/catalogue/PUB30241"Next, we can identify the links for spreadsheets from these pages, and also extract metadata and titles and create a tabular output.
# Identify spreadsheets, headers and descriptions
xls <- map(test, function(x) get_nhsd_sheets(x))
xls1 <- map(xls, data.frame)
xls2 <- map_df(xls1, bind_rows)
colnames(xls2) <- c("xls", "url")
head <- map(test, function(x) get_nhsd_header(x))
head1 <- map(head, data.frame)
head2 <- map_df(head1, bind_rows)
colnames(head2) <- c("title", "url")
meta <- map(test, function(x) get_nhsd_metadata(x))
meta1 <- map(meta, data.frame)
meta2 <- map_df(meta1, bind_rows)
colnames(meta2) <- c("description", "url")
left_join(head2, meta2) %>%
left_join(xls2) %>%
mutate(xls = paste0("<a href =", xls, ">xls</a>")) %>%
mutate(description = str_replace_all(description, "Summary", "")) %>%
DT::datatable(rownames = FALSE, escape = FALSE, filter = "top", extensions = 'Buttons', options = list(
columnDefs = list(list(autowidth = TRUE, width = '10%', targets = list(1, 2))),
scrollX = TRUE,
pageLength = 25,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf')))
#> Joining, by = "url"
#> Joining, by = "url"The URL for PHE blogs is “https://publichealthmatters.blog.gov.uk”.
We can obtain a table of blog topic categories or authors…
url <- "https://publichealthmatters.blog.gov.uk"
url1 <- paste0(url, "/page/", 2:73)
urls <- c(url, url1)
cats <- get_blog_categories(urls)
cats
#> [1] "duncan-selbie-friday-message" "hwb"
#> [3] "hp" "phes-science"
#> [5] "priority2" "local-authority-public-health"
#> [7] "health-improvement" "health-matters"
#> [9] "health-and-work" "disease-detectives"
#> [11] "data-blog" "digital"
#> [13] "uncategorized" "cko"
#> [15] "science-hub" "london-region"
#> [17] "priority3" "global-health"
#> [19] "northern-region" "health-profile-for-england"
#> [21] "priority5" "phe-announcement"
#> [23] "phe-people" "health-in-a-changing-climate"
#> [25] "health-visitors" "the-week-at-phe"
#> [27] "priority6" "nursing"
#> [29] "priority4" "midlands-and-east-of-england"
#> [31] "midwifery" "priority1"
#> [33] "priority7" "microbiology-services"
#> [35] "mythbuster" "ab"
url <- "https://publichealthmatters.blog.gov.uk"
authors <- get_blog_authors(url)
authors
#> [1] "duncan-selbie" "kate-folkard"
#> [3] "dominik-zenner" "binta-hanakuka"
#> [5] "brian-ferguson" "peter-kelly"
#> [7] "local-authority-public-health" "martin-dockrell"
#> [9] "ann-marie-connolly" "jane-south"
#> [11] "carla-brown" "catherine-hayes"
#> [13] "eamonn-omoore" "jo-peden"
#> [15] "justin-varney" "alison-tedstone"
#> [17] "martin-ward-platt" "jude-stansfield"
#> [19] "mary-bythell" "jamie-waterall"
#> [21] "jennifer-lloyd" "leoni-belsman"
#> [23] "rosanna-oconnor" "dave-jephson"
#> [25] "justine-fitzpatrick" "clare-griffiths"
#> [27] "eustace-de-sousa" "claire-robson"
#> [29] "emma-seria-walker" "linda-hindle"
#> [31] "helen-christmas" "charlotte-fellows"
#> [33] "james-westwood" "john-newton"
#> [35] "blog-editor" "diarmaid-crean"
#> [37] "jane-leaman" "vanessa-saliba"
#> [39] "lucy-elliss-brookes" "ravi-jaipaul"
#> [41] "paul-cosford" "lucy-young"
#> [43] "liz-price" "jem-rashbass"
#> [45] "ruth-swann" "richard-gleave"
#> [47] "yvonne-doyle" "valerie-delpech"
#> [49] "louis-levy" "sema-mandal"
#> [51] "anthony-nardone" "diane-ashiru-oredope"
#> [53] "clare-heaviside" "dame-carol-black"
#> [55] "shaun-donaghy" "jim-mcmanus"
#> [57] "sandra-white" "paul-johnstone"
#> [59] "helen-garnham" "gregor-henderson"
#> [61] "russ-moody" "david-green"
#> [63] "10" "sacha-wyke"
#> [65] "allan-baker" "sonia-gill"
#> [67] "marilena-korkodilos" "jenny-harries"
#> [69] "matthew-day" "nino-manddalena"
#> [71] "sarah-smith" "tim-gant"
#> [73] "tony-vickers-byrne" "anne-mackie"
#> [75] "christine-mccartney" "simon-thomas"
#> [77] "chloe-johnson" "claire-laurent"
#> [79] "08" "ruth-stubbs"
#> [81] "mohammed-vaqar" "helen-harris"
#> [83] "alison-keating" "faith-ege"
#> [85] "mel-sirotkin" "jamie-blackshaw"
#> [87] "sophie-newbound" "michael-brodie"
#> [89] "joanne-yarwood" "john-ohagan"
#> [91] "elizabeth-castle" "tim-chadborn"
#> [93] "kevin-fenton" "leonie-prasad"
#> [95] "panos-zerdevas" "sarah-anderson"
#> [97] "clare-perkins" "derren-ready"
#> [99] "david-pencheon" "viv-bennett"
#> [101] "craig-wright" "maxine-johnson"
#> [103] "polly-kwok" "sabrina-safo"
#> [105] "mary-e-black" "lucy-irvine"
#> [107] "john-robson" "henrik-moller"
#> [109] "katherine-henson" "nuzhat-ali"
#> [111] "owen-brigstock-barron" "gul-root"
#> [113] "natasha-roberts" "ruth-simmons"
#> [115] "sean-mcphail" "stephen-morton"
#> [117] "nicholas-coyle" "wendy-nicholson"
#> [119] "elaine-rashbrook" "charles-alessi"
#> [121] "mike-gent" "geraldine-strathdee"
#> [123] "karen-shaw" "virginia-murray"
#> [125] "shona-arora" "chris-hatton"
#> [127] "matt-kearney" "james-charnock"
#> [129] "carl-petrokofsky" "matt-hennessey"
#> [131] "paul-crook" "angie-bone"
#> [133] "rebecca-hams" "jill-meara"
#> [135] "paul-hamblin" "darryl-quantz"
#> [137] "jenifer-smith" "alison-hadley"
#> [139] "bernd-eggen" "mary"
#> [141] "richard-pebody" "steve-owens"
#> [143] "anthony-kessel" "henry-kippin"
#> [145] "dr-sotiris-vardoulakis" "charlotte-mcclymont"
#> [147] "luke-hounsome" "charlotte-eley"
#> [149] "vicki-young" "lorraine-oldridge"
#> [151] "jolyon-medlock" "alexander-vaux"
#> [153] "01" "sani-dimitroulopoulou"
#> [155] "lily-makurah" "susan-hopkins"
#> [157] "jim-obrien" "peter-bradley"
#> [159] "chris-carrigan" "michael-heasman"
#> [161] "david-madden" "helen-duncan"
#> [163] "julian-flowers" "kayleigh-hansford"
#> [165] "matthew-dryden" "sonia-roschnik"
#> [167] "julia-verne" "jenny-godson"
#> [169] "julia-csikar" "katie-carmichael"
#> [171] "catherine-heffernan" "joanne-bosanquet"
#> [173] "sue-ibbotson" "sarah-stevens"
#> [175] "neil-mccoll" "neil-bentley"
#> [177] "ann-tonks" "judith-rankin"
#> [179] "brian-mccloskey" "dr-jo-whaley"
#> [181] "emma-aarons" "ibrahim-abubakar"
#> [183] "melvina-woode-owusu" "julietta-patnick"
#> [185] "naomi-holman" "jane-anderson"
#> [187] "ann-hoskins" "susie-singleton"
#> [189] "mahesh-patel" "sophie"
#> [191] "the-mythbuster" "christine-harvey"
#> [193] "david-rhodes" "mark-salter"
#> [195] "meng-khaw" "claire-currie"
#> [197] "geraldine-oliver" "don-sinclair"
#> [199] "helen-shaw" "hilary-kirkbride"
#> [201] "sally-warren" "isabel-oliver"
#> [203] "alex-elliot" "heather-lodge"
#> [205] "caroline-hancock" "hywell-dinsdale"
#> [207] "andrew-cooper" "rashmi-shukla"
#> [209] "jonathan-marron" "nick-phin"
#> [211] "muir-gray" "david-heymann"We can extract blog posts for any given category e.g. the-week-at-phe
n <- 6
url <- "https://publichealthmatters.blog.gov.uk/category/the-week-at-phe/"
url1 <- paste0(url, "/page/", 2:n)
ifelse( n>1, urls <- c(url, url1), urls <- url)
#> [1] "https://publichealthmatters.blog.gov.uk/category/the-week-at-phe/"
#cat <- "local-authority-public-health/"
links <- map(urls, function(x) get_blog_links(x))
links1 <- flatten(links) %>% unique() %>% map_chr(1)
links1 <- links1[-11]
blogs <- map(links1, function(x) get_blog_text(x))
blogs1 <- map(blogs, data.frame)
blogs2 <- map_df(blogs1, bind_rows)
colnames(blogs2) <- c("text", "url")
palette <- RColorBrewer::brewer.pal(10, rev("Spectral"))
blogs2 %>%
create_bigrams(text) %>%
group_by(bigram) %>%
count(sort=TRUE) %>%
filter(!bigram %in% c("public health", "we’re pleased", "past week", "phe’s online", "week here’s", "online activity")) %>%
with(., wordcloud::wordcloud(bigram, n, colors = palette, random.order = FALSE, random.color = FALSE, max.words = "Inf", rot.per = 0.4, scale = c(4, 0.1)))or Duncan’s friday message
n <- 6
url <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url1 <- paste0(url, "/page/", 2:n)
ifelse( n>1, urls <- c(url, url1), urls <- url)
#> [1] "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
#cat <- "local-authority-public-health/"
links <- map(urls, function(x) get_blog_links(x))
links1 <- flatten(links) %>% unique() %>% map_chr(1)
links1 <- links1[-11]
blogs <- map(links1, function(x) get_blog_text(x))
blogs1 <- map(blogs, data.frame)
blogs2 <- map_df(blogs1, bind_rows)
colnames(blogs2) <- c("text", "url")
palette <- RColorBrewer::brewer.pal(11, "Spectral")
blogs2 %>%
create_bigrams(text) %>%
group_by(bigram) %>%
count(sort=TRUE) %>%
ungroup() %>%
mutate(bigram = tm::removeNumbers(bigram)) %>%
filter(!bigram %in% c("public health", "we’re pleased", "past week", "phe’s online", "week here’s", "online activity", "friday messages", "wishes friday")) %>%
with(., wordcloud::wordcloud(bigram, n, colors = palette, random.order = FALSE, random.color = FALSE, max.words = "Inf", rot.per = 0.4, scale = c(4, 0.1)))library(igraph)
library(ggraph)
bigrams <- create_bigrams(blogs2, text)
bigrams <- bigrams %>%
separate(url, remove = FALSE, c("root", "name", "cat", "message", "day", "month", "year" ), sep = "-") %>%
mutate(year = str_replace_all(year, "/", "")) %>%
unite(date, day, month, year, sep ="-") %>%
select(date, bigram, n)
big_count <- bigrams %>%
group_by(date, bigram) %>%
count(sort = TRUE) %>%
#separate(bigram, c("word1", "word2")) %>%
filter(n >1)
big_graph <- big_count %>%
graph_from_data_frame()
ggraph(big_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n)) +
geom_node_point(color = "green") +
geom_node_text(aes(label = name), size = rel(3), alpha = 0.7, colour = "black", vjust = 1, hjust = 1)+
theme_void() +
theme(legend.position = "bottom") +
labs(title = "Network plot of Friday messages ")library(quanteda)
library(tm)
blog_corp <- quanteda::corpus(blogs2$text)
dfm_phe <- quanteda::dfm(blog_corp, remove = c(stopwords("en"), "government", "phe", "health", "uploads", "harlow", "nhs", "england", "the_nhs", "phe_harlow", "attachment_data", "victoria", "london", "swh", "eu"), remove_punct = TRUE, remove_numbers = TRUE, tolower = TRUE, ngrams = 1:2, context = "window")
feat <- names(topfeatures(dfm_phe, 300))
dfm_select(dfm_phe, feat) %>% textplot_network(min_freq = 0.9, omit_isolated = TRUE, vertex_labelfont = "Arial Narrow") +
labs(title = "Network map of most frequent terms used in remit letters and PHE business plan",
subtitle = "2 and 3 word ngrams")Its hard to get everything PHE has published on .GVO>UK. To assist this process we have written a function which produces an interactive table of all the PHE publications by category (NB at the moment it is over inclusive). This makes use of the DT package and allows us to add download options so the data can be downloaded in various forms.